--- title: "Regression" author: "Oliver" date: ' ' output: word_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This document shows how to use **R/Studio** to create output that is similar to that shown in Lock^5 (2nd Ed.). ## Section 9.1 In this section we look at inference for slope and correlation. To do so we need to import some data. We start by looking at the cricket chirp data from the book's data repository. ```{r, echo=TRUE} ### Read the CSV file crickets = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/Lock5Ed2/Lock5Data2eCSV/CricketChirps.csv") ### See what is in the "crickets" dataframe to see if it worked crickets ``` The above seven observations look to be the same as those in the original CSV file. Suppose that we are interested in predicting ***Temperature*** from the number of ***Chirps*** that the crickets generate. We can plot the data to see if the relationship appears to be linear. ```{r, echo=TRUE} ### Load the lattice package to make plotting easy library(lattice) ### Now plot the points. Note the use of ~ instead of = in the formula xyplot(Temperature ~ Chirps, data=crickets) ``` Although there are not many points, the relationship appears to be fairly linear. R makes getting the regression equation easy. ```{r, echo=TRUE} ### Get the regression equation by fitting the linear model (lm) lm(Temperature ~ Chirps, data=crickets) ### Save the linear model for later use crickets.lm = lm(Temperature ~ Chirps, data=crickets) ### Check to see if it was saved crickets.lm ### Plot the points and the regression line using p and r for points and regression xyplot(Temperature ~ Chirps, data=crickets, type=c("p","r")) ``` The model is Temperature = 37.677 + 0.231(Chirps). What is a 95% confidence interval for $\beta_1$? Is the slope significantly different from zero? That is, does ***Chirps*** add anything to the guess of ***Temperature***? We can let R compute the standard error and t-value that are needed to test $H_0: \beta_1 = 0$ versus $H_A: \beta_1 \ne 0$. ```{r} ### Check that the linear model fitted above is still around crickets.lm ### Now get a summary that contains the coefficients and standard errors summary(crickets.lm) ``` Note that the line that starts with ***Chirps*** contains an estimate (0.23067) and a standard error (0.01423). These values can be used to create a confidence interval for $\beta_1$. We note that the degrees of freedom for the t-distribution are $df = 7-2 = 5$. We now compute a 95% confidence interval using $b_1 \pm t^*_{n-2} \cdot SE$ ```{r} ### Compute t-star for 5%/2 = 2.5% to the right with df=5 qt(0.025, 5, lower.tail=FALSE) ### Compute the endpoints using the information found above 0.23067 - 2.57058 * 0.01423 0.23067 + 2.57058 * 0.01423 ``` So, it appears that we can be 95% sure that the slope of the population model, $\beta_1$, is in the interval (0.194, 0.267). That is, we are 95% sure that the slope of the model to predict temperature from cricket chirp rate is between 0.194 and 0.267 degrees F per chirp. We now test to see if $\beta_1$ is equal to zero. ```{r} ### Compute t and the degrees of freedom (0.23067-0)/0.01423 7-2 ### Get the p-value for a two-sided alternative using the right tail of the t distribution for t = 16.2101 pt(16.2101, 5 , lower.tail=FALSE)*2 ``` From the above we see that the t-value is $t = (0.23067 - 0)/0.01423 = 16.2101$ which has $df = 7-2 = 5$ degrees of freedom. This compares well with R's value of 16.21. Using either StatKey or R we can find the p-value by finding the area above 16.21 when there are 5 degrees of freedom. Above we see that R computes the p-value as 0.0000163. This is the same p-value presented in the table under Pr(>|t|) and indicates that We have very strong evidence that slope is different from zero, and that temperature is related to cricket chirp rates. R also computes the correlation between two variables. ```{r} ### Compute the correlation between temperature and chirps cor(crickets$Temperature, crickets$Chirps) ### Cheat and find the correlation between all variables in the dataframe cor(crickets) ``` To test $H_0: \rho = 0$ versus $H_A: \rho \ne 0$ we compute the t statistic $$ t = \frac{r - 0}{\sqrt{\frac{1 - r^2}{n-2}}} = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$ which has an approximate t-distribution with $n-2$ degrees of freedom. For the cricket data this is ```{r} ### Compute the sample correlation, r cor(crickets$Temperature, crickets$Chirps) ### Compute t (0.9906249*sqrt(7-2))/sqrt(1-0.99062^2) ``` This value is the same (within rounding) as the t-statistic computed using the slope. The p-value will be the same as well. R computes these values for us more easily using the **cor.test** function. ```{r} ### Use cor.test to get the t-stat and p-value cor.test(crickets$Temperature, crickets$Chirps) ``` It is easy to get the **coefficient of determination**, $R^2 = r^2$, using R. ```{r} ### Find the correlation, r r = cor(crickets$Temperature, crickets$Chirps) r ### Now square it to get R-squared r^2 ``` For the cricket data, ***Chirps*** explains about 98.1% of the variability in ***Temperature*** via the regression line. Stated another way, the proportion of variability in temperature that is explained by the model based on chirp rate is about 0.981. Note that this value corresponds to the "Multiple R-squared" value from the summary table above. We should check the residuals to make sure that the assumptions that are made when fitting a regression model have been satisfied. Plots of the residuals help with this. ```{r} ### Replot the data with the regression line xyplot(Temperature ~ Chirps, data=crickets, type=c("p","r")) ### Plot the residuals against the explanatory variable plot(crickets$Chirps, resid(crickets.lm)) abline(h=0, lty=2) ### Plot the normal probability plot qqnorm(resid(crickets.lm)) qqline(resid(crickets.lm)) ``` With only seven observations it is difficult to say that the assumptions have not been satisfied. There do not appear to be any serious outliers or influential observations. We also do not see any obvious departure from normality.